Distributed Representations for Biological Sequence Analysis
نویسندگان
چکیده
Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence – over a nucleotide or amino acid alphabet – in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.
منابع مشابه
iProsite: an improved prosite database achieved by replacing ambiguous positions with more informative representations
PROSITE database contains a set of entries corresponding to protein families, which are used to identify the family of a protein from its sequence. Although patterns and profiles are developed to be very selective, each may have false positive or negative hits. Considering false positives as items that reduce the selectiveness of a pattern, then, the more selective pattern we have, a more accur...
متن کاملStrong convergence for variational inequalities and equilibrium problems and representations
We introduce an implicit method for nding a common element of the set of solutions of systems of equilibrium problems and the set of common xed points of a sequence of nonexpansive mappings and a representation of nonexpansive mappings. Then we prove the strong convergence of the proposed implicit schemes to the unique solution of a variational inequality, which is the optimality condition for ...
متن کاملBiological Activity Analysis of Native and Recombinant Streptokinase Using Clot Lysis and Chromogenic Substrate Assay
Determination of streptokinase activity is usually accomplished through two assay methods: a) Clot lysis, b) Chromogenic substrate assay. In this study the biological activity of two streptokinase products, namely Streptase®, which is a native product and Heberkinasa®, which is a recombinant product, was determined against the third international reference standard using the two forementioned a...
متن کاملBiological Activity Analysis of Native and Recombinant Streptokinase Using Clot Lysis and Chromogenic Substrate Assay
Determination of streptokinase activity is usually accomplished through two assay methods: a) Clot lysis, b) Chromogenic substrate assay. In this study the biological activity of two streptokinase products, namely Streptase®, which is a native product and Heberkinasa®, which is a recombinant product, was determined against the third international reference standard using the two forementioned a...
متن کاملdna2vec: Consistent vector representations of variable-length k-mers
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to so...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1608.05949 شماره
صفحات -
تاریخ انتشار 2016